To evaluate the impact of Omitted Variable Bias (OVB) on different models, consider a system where oceanography drives site temperature and recruitment over time. Temperature also fluctuates over time within each site. Recruitment and temperature influence snail abundance, as do other uncorrelated drivers. You have conducted a study in this system measuring 10 sites, each site sampled once per year over 10 years. In this study, you have recorded snail abundance and temperature. But have no measure of recruitment. Note, the results of models will be the same if you had instead sampled in one year across 10 sites with 10 plots per site if there were plot-level drivers of temperature that behaved the same as below.
To parameterize our simulations, consider the following:
Thus, the system looks like this:
To simulated data, let’s begin by loading some libraries
library(tidyverse)
library(lme4)
library(broom)
library(broom.mixed)
library(DiagrammeR)
library(glue)
theme_set(theme_bw(base_size = 14))
Next, we need a function that will create a template of simulated sites based on oceanography and the sampling design described above.
make_environment <- function(n_sites = 10,
ocean_temp = 2,
temp_sd = 0,
ocean_recruitment = -2,
recruitment_sd = 0,
temp_mean = 15,
rec_mean = 10,
site_sd = 1,
seed = NULL){
if(!is.null(seed)) set.seed(seed)
tibble(
site = as.character(1:n_sites)) %>%
mutate(
oceanography = rnorm(n_sites),
site_temp = temp_mean +
rnorm(n_sites, ocean_temp * oceanography, temp_sd),
site_recruitment = rec_mean +
rnorm(n_sites, ocean_recruitment * oceanography, recruitment_sd),
site_int = rnorm(n_sites, 0, sd = site_sd)
)
}
Great. Now, we need to add that year-to-year or plot-to-plot variability.
make_plots <- function(sites_df,
n_plots_per_site = 10,
plot_temp_sd = 1,
temp_effect = 1,
recruitment_effect = 1,
sd_plot = 2,
seed = NULL){
if(!is.null(seed)) set.seed(seed)
sites_df %>%
rowwise() %>%
mutate(
plot_temp_dev_actual = list(rnorm(n_plots_per_site,
0, plot_temp_sd))) %>%
unnest(plot_temp_dev_actual) %>%
mutate(
plot_temp = site_temp + plot_temp_dev_actual,
snails = rnorm(n(),
temp_effect*plot_temp +
recruitment_effect*site_recruitment +
site_int,
sd_plot)) %>%
ungroup() %>%
group_by(site) %>%
mutate(year = 1:n(),
site_mean_temp = mean(plot_temp),
plot_temp_dev = plot_temp - site_mean_temp,
site_mean_snail = mean(snails),
site_snail_dev = snails - mean(snails),
delta_snails = snails - lag(snails),
delta_temp = plot_temp - lag(plot_temp)) %>%
ungroup()
}
To analyze our data, we will compare several different fit models.
analyze_plots <- function(plot_df){
m <- tribble(
~model_type, ~fit,
"Naive", lm(snails ~ plot_temp, data = plot_df),
"RE", lmer(snails ~ plot_temp + (1|site), data = plot_df),
"FE with Dummy Variables", lm(snails ~ plot_temp + site, data = plot_df),
"FE Using Mean Differencing", lm(snails ~ plot_temp + site, data = plot_df), #fix SE?
"Group Mean Covariate", lmer(snails ~ plot_temp + site_mean_temp + (1|site), data = plot_df),
"Group Mean Centered", lmer(snails ~ plot_temp_dev + site_mean_temp + (1|site), data = plot_df),
"Group Mean Covariate, no RE", lm(snails ~ plot_temp + site_mean_temp, data = plot_df),
"Group Mean Centered, no RE", lm(snails ~ plot_temp_dev + site_mean_temp, data = plot_df),
"First Differences", lm(delta_snails ~ delta_temp,data = plot_df) #fix SE?
) %>%
mutate(coefs = map(fit, tidy), #get coefficients with broom
out_stats = map(fit, glance),
temp_effect = map(coefs, get_temp_coef),
model_type = fct_inorder(model_type))
m
}
get_temp_coef <- function(a_tidy_frame){
a_tidy_frame %>%
filter(term %in% c("plot_temp", "plot_temp_dev", "delta_temp")) %>%
select(estimate, std.error)
}
Let’s begin by setting up 100 replicate simulations.
set.seed(31415)
n_sims <- 100
envt <- tibble(
sims = 1:n_sims
) %>%
mutate(sites = map(sims, ~make_environment()))
Just for a sanity check, here’s the relationship between temperature and recruitment at the site level across all simulations.
ggplot(envt %>%
unnest(sites),
aes(x = site_temp, y = site_recruitment, group = sims)) +
geom_point(alpha = 0.4, color = 'grey') +
stat_smooth(method = "lm", size = 0.5, alpha = 0.5,
fill = NA, color = "black") +
labs(x = "Site Average Temperature",
y = "Site Recruitment of Individuals Per Plot")
Great! Now, let’s setup our sampling over time.
plots_df <- envt %>%
mutate(site_year = map(sites, make_plots))
And, again, a sanity check…
ggplot(plots_df %>%
unnest(site_year),
aes(x = plot_temp, y = snails, color = as.character(sims), group = sims)) +
stat_smooth(method = "lm", fill = "NA") +
guides(color = "none") +
labs(x = "Plot Temperature",
y = "Snails per Plot")
So we can see that in this setup, the snail-temperature relationship is positive in nearly all of the simulations. But, how positive is it? What would our coefficients show on average across simulations?
Let’s fit models to each set of data
analysis_df <- plots_df %>%
mutate(analysis = map(site_year, analyze_plots)) %>%
unnest(analysis)
And now let’s look at the distribution of coefficients that would describe the relationship between temperature and snails from each mode.
analysis_df %>%
unnest(temp_effect) %>%
ggplot(aes(y = fct_rev(model_type), x = estimate)) +
ggridges::stat_density_ridges() +
labs(y="", x = "Model Estimated Temperature Effect")
Eyeballing it, we can see of course the naive model is too low, as is the RE model. How bad is the bias for the RE model?
analysis_df %>%
unnest(temp_effect) %>%
group_by(`Model Type` = model_type) %>%
summarize(`Mean Estimate` = mean(estimate),
`SD Estimate` = sd(estimate)) %>%
knit_table
| Model Type | Mean Estimate | SD Estimate |
|---|---|---|
| Naive | 0.231 | 0.165 |
| RE | 0.640 | 0.232 |
| FE with Dummy Variables | 0.985 | 0.215 |
| FE Using Mean Differencing | 0.985 | 0.215 |
| Group Mean Covariate | 0.985 | 0.215 |
| Group Mean Centered | 0.985 | 0.215 |
| Group Mean Covariate, no RE | 0.985 | 0.215 |
| Group Mean Centered, no RE | 0.985 | 0.215 |
| First Differences | 0.971 | 0.259 |
The downward bias produced by poor model choice is clear. We can see it if we plot the coefficient minus one, which if unbiased should reveal a distribution centered on 0.
analysis_df %>%
unnest(temp_effect) %>%
ggplot(aes(y = fct_rev(model_type), x = estimate - 1)) +
ggridges::stat_density_ridges() +
labs(y="", x = "Bias")
We can also look at the distribution of, for each simulated data set, how different the RE model is from each other model.
analysis_df %>%
filter(model_type != "Naive") %>%
unnest(temp_effect) %>%
group_by(sims) %>%
mutate(diff_from_re = estimate - estimate[1]) %>%
ungroup() %>%
filter(model_type != "RE") %>%
ggplot(aes(y = fct_rev(model_type), x = diff_from_re)) +
ggridges::stat_density_ridges() +
labs(y="", x = "Difference from RE Model\nTemperature Effect")
Note, in all cases, we can see the effects of downward bias. This is clear, but, to put it in numbers -
analysis_df %>%
filter(model_type != "Naive") %>%
unnest(temp_effect) %>%
group_by(sims) %>%
mutate(diff_from_re = estimate - estimate[1]) %>%
ungroup() %>%
filter(model_type != "RE") %>%
group_by(model_type) %>%
summarize(`Mean Diff from RE` = mean(diff_from_re),
`SD Diff from RE` = sd(diff_from_re)) %>%
knit_table
| model_type | Mean Diff from RE | SD Diff from RE |
|---|---|---|
| FE with Dummy Variables | 0.346 | 0.138 |
| FE Using Mean Differencing | 0.346 | 0.138 |
| Group Mean Covariate | 0.346 | 0.138 |
| Group Mean Centered | 0.346 | 0.138 |
| Group Mean Covariate, no RE | 0.346 | 0.138 |
| Group Mean Centered, no RE | 0.346 | 0.138 |
| First Differences | 0.331 | 0.218 |
If we were doing straight hypothesis testing, how often would our estimate of the temperature coefficient either overlap 0 or not have 1 within its confidence interval?
analysis_df %>%
filter(model_type != "Naive") %>%
unnest(temp_effect) %>%
mutate(overlap_0 = (estimate - 2*std.error)<0,
overlap_1 = (estimate - 2*std.error)<1 &
(estimate + 2*std.error)>1) %>%
group_by(`Model Type` = model_type) %>%
summarize(`95% CI Contains 0` = sum(overlap_0)/n(),
`95% CI does Not Contain 1` = (n()-sum(overlap_1))/n()) %>%
knit_table
| Model Type | 95% CI Contains 0 | 95% CI does Not Contain 1 |
|---|---|---|
| RE | 0.08 | 0.54 |
| FE with Dummy Variables | 0.00 | 0.05 |
| FE Using Mean Differencing | 0.00 | 0.05 |
| Group Mean Covariate | 0.00 | 0.05 |
| Group Mean Centered | 0.00 | 0.05 |
| Group Mean Covariate, no RE | 0.01 | 0.04 |
| Group Mean Centered, no RE | 0.01 | 0.04 |
| First Differences | 0.01 | 0.12 |
Here we see the RE model is more likely to be subject to type II error. Further, it is far more likely than any other technique to not have the true coefficient value within 2 CI of its estimand.
This has been useful, but, if we want to automate the process for further exploration, let’s wrap the code above into a function.
make_sims_and_analyze <- function(n_sims = 100,
n_sites = 10,
ocean_temp = 2,
temp_sd = 0,
ocean_recruitment = -2,
recruitment_sd = 0,
temp_mean = 15,
rec_mean = 10,
site_sd = 1,
n_plots_per_site = 10,
plot_temp_sd = 1,
temp_effect = 1,
recruitment_effect = 1,
sd_plot = 2,
seed = NULL) {
#should we set a seed?
if (!is.null(seed))
set.seed(seed)
# make an envt data frame
out_df <- tibble(sims = 1:n_sims) %>%
mutate(
sites = map(
sims,
~make_environment(
n_sites = n_sites,
ocean_temp = ocean_temp,
temp_sd = temp_sd,
ocean_recruitment = ocean_recruitment,
recruitment_sd = recruitment_sd,
temp_mean = temp_mean,
rec_mean = rec_mean,
site_sd = site_sd)
)
) %>%
#now add plots
mutate(
site_year = map(
sites,
make_plots,
n_plots_per_site = n_plots_per_site,
plot_temp_sd = plot_temp_sd,
temp_effect = temp_effect,
recruitment_effect = recruitment_effect,
sd_plot = sd_plot
)
) %>%
#and analysis
mutate(analysis = map(site_year, analyze_plots)) %>%
unnest(analysis)
}
If we look at the correlated random effects models, what’s the RE?
analysis_df %>%
filter(model_type %in% c("Group Mean Covariate", "Group Mean Centered")) %>%
unnest(coefs) %>%
filter(group == "site") %>%
group_by(`Model Type` = model_type) %>%
summarize(Term = "Site Random Effect",
`Mean Site SD` = mean(estimate),
`SD in Site SD` = sd(estimate)) %>%
knit_table
| Model Type | Term | Mean Site SD | SD in Site SD |
|---|---|---|---|
| Group Mean Covariate | Site Random Effect | 0.977 | 0.438 |
| Group Mean Centered | Site Random Effect | 0.977 | 0.438 |
Both are the same, which makes sense given the formulation of the model. What if there was no additional site-level variation uncorrelated with temperature, though?
re_0_frame <- make_sims_and_analyze(site_sd = 0)
re_0_frame %>%
filter(model_type %in% c("Group Mean Covariate", "Group Mean Centered")) %>%
unnest(coefs) %>%
filter(group == "site") %>%
group_by(`Model Type` = model_type) %>%
summarize(Term = "Site Random Effect",
`Mean Site SD` = mean(estimate),
`SD in Site SD` = sd(estimate)) %>%
knit_table
| Model Type | Term | Mean Site SD | SD in Site SD |
|---|---|---|---|
| Group Mean Covariate | Site Random Effect | 0.273 | 0.261 |
| Group Mean Centered | Site Random Effect | 0.273 | 0.261 |
Both REs are the same - again - but both overlap 0. In the system we simulated, there is no uncorrelated site-level variability. It’s all at the site-year (or site-plot) level. We have indeed run these models with no site random effect. They produce the same answers for coefficients - both in this simulation with no site-level random effect as well as above when there was a site-level random effect.
# Without random site variation
re_0_frame %>%
filter(grepl("Group Mean", model_type)) %>%
unnest(temp_effect) %>%
group_by(`Model Type` = model_type) %>%
summarize(`Mean Temp Effect` = mean(estimate),
`SD Temp Effect` = sd(estimate)) %>%
knit_table
| Model Type | Mean Temp Effect | SD Temp Effect |
|---|---|---|
| Group Mean Covariate | 0.969 | 0.203 |
| Group Mean Centered | 0.969 | 0.203 |
| Group Mean Covariate, no RE | 0.969 | 0.203 |
| Group Mean Centered, no RE | 0.969 | 0.203 |
# With random site variation
analysis_df %>%
filter(grepl("Group Mean", model_type)) %>%
unnest(temp_effect) %>%
group_by(`Model Type` = model_type) %>%
summarize(`Mean Temp Effect` = mean(estimate),
`SD Temp Effect` = sd(estimate)) %>%
knit_table
| Model Type | Mean Temp Effect | SD Temp Effect |
|---|---|---|
| Group Mean Covariate | 0.985 | 0.215 |
| Group Mean Centered | 0.985 | 0.215 |
| Group Mean Covariate, no RE | 0.985 | 0.215 |
| Group Mean Centered, no RE | 0.985 | 0.215 |
So should we just not worry about a site-level random effect? Not necesarily. First, if we consider a site RE, we can look at residual variation due to both between site differences as well as residual replicate-level variation. This can be a useful analysis when attempting to tease apart variation that is versus is not correlated with a driver of interest. Second, while models without a site random effect can be fit and used, these models will be structurally incorrect - replicates are not IID, as there is correlated error within sites. Last, mixed models can also provide advantages when handling models with unbalanced data between sites (see below).
Practically, however, the difference comes in with respect to whether you are interested in between site variation or not. The residual inflates with no RE, as it is now the combination of between site residual and within site residual terms. This could make a difference for various statistical tests down the line as well, but in terms of parameter estimates, we are still estimating a clean causal effect.
One of the advantages to mixed modeling approaches is how they handle unbalanced data. What if, in the above example, we had lost samples from each site generating unbalanced data?
set.seed(31415)
#what do we keep?
years_to_keep <- round(runif(10, 3, 10)) %>%
imap_dfr( ~ tibble(site = as.character(.y),
year = sample(1:10, .x),
keep = T))
#a function for unbalancing
unbalance <- function(site_year_df, keep_df){
left_join(site_year_df, keep_df) %>%
filter(keep)
}
#apply keeping to each site-year
unbalanced_df <- analysis_df %>%
select(-model_type, -fit, -coefs, -out_stats, -temp_effect) %>%
mutate(site_year = map(site_year, unbalance,
keep_df = years_to_keep),
analysis = map(site_year, analyze_plots)) %>%
unnest(analysis)
How does this lack of balance affect estimation of causal coefficients?
unbalanced_df %>%
unnest(temp_effect) %>%
group_by(`Model Type` = model_type) %>%
summarize(`Mean Estimate` = mean(estimate),
`SD Estimate` = sd(estimate)) %>%
knit_table
| Model Type | Mean Estimate | SD Estimate |
|---|---|---|
| Naive | 0.245 | 0.195 |
| RE | 0.500 | 0.297 |
| FE with Dummy Variables | 0.990 | 0.281 |
| FE Using Mean Differencing | 0.990 | 0.281 |
| Group Mean Covariate | 0.986 | 0.281 |
| Group Mean Centered | 0.986 | 0.281 |
| Group Mean Covariate, no RE | 0.982 | 0.299 |
| Group Mean Centered, no RE | 0.982 | 0.299 |
| First Differences | 0.961 | 0.316 |
We can see that point estimation here is improved somewhat. Although this could vary. More telling would be an exploration of those group means in the mixed versus fixed models.
Apply cluster robust standard errors to models without random effects!!! To be expopounded upon later by Laura
library(fixest)
fixest_df <- analysis_df %>%
group_by(sims) %>%
slice(1L) %>%
ungroup() %>%
select(sims, sites, site_year) %>%
mutate(fixest_fit = map(site_year,
~feols(snails ~ plot_temp | site,
cluster = "site",
data = .x)),
first_diff_fit = map(site_year,
~feols(delta_snails ~ delta_temp,
cluster = "site",
data = .x))
) %>%
pivot_longer(cols = c(fixest_fit, first_diff_fit),
names_to = "model_type",
values_to = "fit") %>%
mutate(coefs = map(fit, tidy))